Data for Dogs and Cats: Using Modern Methods on the Age Old Issue of Animal Homelessness

by Kyle Weston

Table of contents:

Executive Summary

Purpose

The purpose of this tutorial is to demonstrate that data collection and analysis can provide better outcomes and significantly increase the welfare for animals within shelters. Basic data collection in shelters is common but not used to its full potential. This tutorial takes shelter data from the Austin Animal Center and uses simple information collected from each animal to generate new insights on that can inform shelter policies. These insights include which animals may be avoided by adopters, geographic info on where animals may be found, and which factors may be most important for an animal to be adopted into a loving home: the ultimate goal of any animal shelter.

While many shelters do keep some data on animals that pass through them for reporting purposes, they fail to use this data analytically. This may be due to bad quality and inconsistencies in keeping the data or a lack of software and expertise that keeps shelters from considering this data for analysis. Perhaps, the most common reason is inertia and resources. Many shelters spend their scant amount of money and time on animal maintenance. This gives them few resources to spend on data science which is an often expensive and time-consuming prospect.

Although many shelters would not consider data science a priority, the fact is that such analysis can reduce costs and increase animal welfare by redirecting resources to where they are needed. Analysis of animal data can also give a shelter metrics of progress. Even a simple number can allow the shelter to understand where its strong and weak points lie. This may be in bringing in fosters or maintaining the necessary living conditions of certain types of animals. Finally, data gives a clear empirical account of what is happening within a non-profit shelter. Such information can be an important tool for motivating executive officers to action and can give transparency which is vital in the NPO sector.

Getting the Data

The datasets consist of shelter data from the Austin's Animal Center's shelter. It consists of two separate tables: one with the animals initial condition upon being received by the shelter and one with the animals outcome. The data is live and updated daily.

Basic Exploration

Most of the columns are self-explanatory. Note that we have two datetime columns that appear identical.

Outcome types include:

The duplicated datetime columns in both datasets exactly the same. Lets remove them from each.

Oddly, there are slightly more outcomes than intakes. Shouldn't it be the other way around?

It also appears that some observations are entirely duplicated. Let's fix that.

While we have taken out duplicated observations, there are duplicated ids. This is because some animals have passed through the shelter multiple times.

Ideally, for every time an animal has stayed and left the shelter, we want an observation in our dataset. This means we will have to eliminate any observations where animals that are still in the shelter and have an intake and no outcome or observations where animals were not properly tagged upon intake (and only have an outcome for their stay in the shelter). To achieve this, we first need to do more cleanup on our data.

Let's convert our datetime strings into something more usable. This will be necessary to compare and join our observations across the two tables.

We will also need to sort these dates for the merge of our two datasets.

We'll do an exact merge on the animal_id column and a closest match merge on datetime. This will match animals that have gone through the shelter multiple times. The forward direction specified is important as we want to match the right datetime, our outcomes, with the closest intake before it.

Our merge should match the appropriate outcome (if there is one) with the corresponding intake. Let's see if it worked by comparing Champ’s intakes and outcomes below with our resulting dataset.

Looks good! We now have both the intake of an animal and its corresponding outtake paired in the same row.

However, we also have some intake observations that did not match with an outcome. Using dropna() we can remove these observations that tell us nothing about the result or duration of an animal's stay.

The merge operation renames columns that appear in both datasets. We will now clear redundant info in an observation and rename columns to something more meaningful.

Let's also check for NA observations.

Interestingly, it seems that age of animals that were euthanized was not logged into outcomes by the shelter.

There are a few animals without outcome types. Let's remove them from our data as well.

Let's drop Diego and Assume that Magna remained female and spayed.

We also need to convert the date of birth column to datetime for a later analysis.

Since we have the animal’s date of birth, the 'age_upon_intake' and 'age_upon_outcome' columns are redundant information. Let's get rid of them.

Initial Plotting

Now that our dataframe is clean and we have our paired observations, we will perform some visual exploration of the data. Most of the features of the resulting dataset are categorical in nature. We will explore the relative size of these categories using bar plots and provide simple analysis to understand what features may be useful, interesting, or surprising.

Mostly cats and dogs pass through the shelter, as expected. There's a smaller number of birds that have been hosted by the shelter and even some livestock. Our future analysis will focus mostly on the dogs and cats that dwarf the observations in the set and use the most shelter resources.

What about the categories for breed?

There's a huge range of breeds in this dataset. Let’s look only at the top 20 or so.

Looks like breeds follows a power law distribution with a handful of very popular breeds and an extremely long tail of niche eclectic mix breeds. The Domestic Shorthair Mix seems to be a catchall for most cats that come into the shelter. This is followed by some easily identified dog breeds in the Pit Bull, Lab, and Chihuahua. Strangely, bat and "bat mix" follow this up in some of the top spots.

What about intake type and condition?

The most common type of intakes are strays followed by surrenders. This information isn't too surprising. Luckily, the Abandoned category seems fairly small relative to the others.

Condition represents another exponential distribution. This time the exponential has a large lambda value and is heavily weighted for normal intakes. Most categories are self-explanatory. Panleuk refers to a virus that infects cats and agonal refers to dying animals. Thankfully, these are the smallest categories.

It appears there's slightly more male animals that come into the shelter versus female animals. Furthermore, the relative amounts for intact versus neutered or spayed animals switch positions upon outtake. Let’s look a little bit more into the number of animals that are fixed by the shelter.

Most animals are adopted out or transferred. Rarely, a few animals are lost or stolen. The Austin animal shelter is a "no-kill" shelter and maintains a "no-kill" policy that is discussed below. Let's look at their rate of euthanasia.

Theres also some more detailed subtypes for outcomes. Let's quickly graph those as well.

Subtypes are dominated by transfers to partnering shelters. Although there are some animal fosters as well. Overall, subtype will not be particularly useful for this analysis.

Let's also look at animal colors. This info can be an important factor in considering animal adoptions as we will explore below.

This feature also contains an overwhelming number of categories. We will again limit by using only 35 color categories in our graph.

This feature is fairly loosely categorized with 622 different colors. Black, White, and Brown keep the top spots.

Okay, now that we've looked at most of the features, let's look at some time series information.

Time Series

There hasn't been a significant trend across initial years. While it appears that 2013 is much lower, this is due to the fact that the Austin shelter's data collection policy started near the end of that year. The other years are relatively uniform except for the clear influence that the pandemic had as it has since lowered intakes by almost half of their previous values.

There is a seasonal pattern here. Intakes and outcomes occur more in the warmer months and drop off at the end/beginning of the year.

Analyizing Geographic Data

The dataset contains data on where each animal has been found. Unfortunately, this information is encoded in plain English as either an address or general location. As with most cases in data-science, this information is useless in a non-numerical form. Fortunately, there are numerous APIs that support geocoding. These services can give latitude and longitude information from a plain text address or other written location. However, they are severely limited by the number of requests one can make. The industry standard is around 2500 free requests for an entire month. We will be using the U.S. Census Bureau's bulk geocoder as it allows up to 10,000. More information on the geocoder is below.

First, we will randomly sample 9000 observations from our data frame to fit under the limit for a single request. We will need to do some cleaning of the provided address to make it suitable for the geocoder. Therefore, we will only sample observations that contain '(TX)' in their address field. This is the standard way the shelter has encoded this location data, but some observations are encoded incorrectly.

We will also grab the indices that each sample had in the original dataset. We will need this info to restitch the data we get back from the geocoder with our original dataset.

The geocoder requires that the location data given consists of street address with the option of including either city and state or zip code. Since we do not have a zip code, we will need to use city and state info. All of this information is encoded within the original dataset as a single string. We can extract the relevant portions by using a regex.

Only one observation doesn't match the regex because they do not have "in" within the address field. We will delete this from our sample.

Now that the data is in the appropriate format, we can save it as a csv and upload it to the geocoder.

The geocoder can be accessed from a browser where a csv can be uploaded manually. Its webpage on the Census Bureau's website is located here: https://geocoding.geo.census.gov/geocoder/locations/addressbatch?form. While, it does match fairly well, it often does not have an existing address in its database. In practice it found a match about 60% of the time. Despite its inability to match, it still outweighs many options due to the large amount of requests it allows, particularly at one time and that it is completely free!

Now let's get our resulting data.

Now, we will create a dataframe with our new lat/long data appended to our original dataframe. This lets us see not only where animals are being found but which animals they are.

Heatmap

Using purely the geographic info, we can create a heatmap for the location of found animals. To do this we will be using the folium library's HeatMap plugin. To do so, we create a new dataframe from our geographical data and pass it into the map.

Finally, we can create a heatmap.

This heatmap weights each observation the same. However, animal types or breeds that are at significant risk or pose danger to the environment can be weighted more heavily to redirect shelter staff or volunteers to "hotter" areas. Heatmaps using shelter data can also map and help track specific populations of homeless animals and support shelters in increase the welfare of animals across their city and county.

Intake Plot

Let's create another map. This time we will focus on type of intake. These include stray animals, those surrendered by owner, wildlife, public assistance, and abandoned or euthanasia requests. We will define a function that will use these types to determine color for markers then plot them on a map.

The vast majority of points in our sample consists of stay animals. However, the red strip to the west shows locations where wildlife prefers to roam or, at least, where it comes into human contact. There is also a slight clustering of public assistance cases in downtown Austin. However, this may be a result of geographical data being tagged "Austin, Texas" with no additional address thus, defaulting to the center of the city. A shelter can use this plot to find out locations or communities most in need of public assistance or populations of wild animals. In either case, such a map can inform a shelter about potential problems and enable it to act on its source.

Form the maps above we can see that geographic information can give a significant edge to any animal shelter, particularly those looking to provide homes to stray animals. Being aware of where stray animals congregate can allow a non-profit to significantly reduce the number of companion animals suffering on the streets. This data can inform programs such as spay/neuter release (SNR) for cats which do not rehome strays but reduce the number of homeless kittens and severely limits population growth.

This location information can also apply beyond stray animals as well. For example, is a specific community having issues with keeping animals or a large number of owner surrenders? A low-cost food bank for animals or other programs can target systemic issues that keep owners from being able to care for their companions. This can give better outcomes to both animals and their owners alike at a fraction of the cost than it would to house these animals within a shelter.

Ultimately, geographic information should always be considered by a shelter. Many shelters significantly rely on the communities around them for support in their operations. Knowing what is happening, and where, within a community is vital. Analyzing where animals are coming from can give significant insight into the reasons animals are finding their way into a shelter. Targeting these causes directly can allow a non-profit to increase its ability be a force for positive change and do so more efficiently than before.

Length of Shelter Stay Analysis

Let's take a look at the distribution of average time that an animal stays at the shelter. This will be our most important metric in determining information relating to adoptions. Length of stay can tell us which animals are adopted quickly and which may be considered less desirable be adopters.

Let’s add this to our dataframe as well. Knowing how long an animal stays at the shelter can give us a significant amount of information.

What about the animal that was in the shelter the longest?

Looks like Patches had a hard time but was eventually released to a foster.

According to Texas Monthly, Patches was eventually adopted by her foster parent last year! More info here: https://www.texasmonthly.com/being-texan/austin-dog-patches-adopted-after-waiting-1913-days/

The distribution appears to be largely exponential with a relatively large lambda value. However, there also seems to be a cluster around the 5-day mark. Let’s look at the distribution of adopted and non-injured pets specifically to see if this affects our distribution of an animal's shelter stay.

Filtering for adoptions only gives us a distribution that looks more "normally" centered around a mean value. The other types of outcomes listed are likely representative of different processes and resolve quickly as they are present in our original distribution but not this one. For example, dying or euthanized animals may be at the shelter for hours rather than the days of adopted animals. Furthermore, animals returned to their owners or transferred to another facility may only be held for the shelter for a small amount of time. Filtering by animals that are adopted out gives more relevant information on an average animal's time of stay in the shelter. This allows us to consider and focus on how to improve the traditional shelter adoption process without factors that are not always under a shelter's control such as sick or dying animal intakes. From the shelter's point of view, we want to maximize adoptions so focusing on adoption time may give us data with practical value to improve the adoption process.

"Black Dog Syndrome"

One anecdotal observation of staff in animal shelters is that black dogs are adopted less often than those with lighter coats. This may be to the perception among adopters that an animal with a darker coat is more aggressive. A study of adoption and euthanasia found, using empirical evidence, that Black Dog Syndrome was a real factor in Sacramento County animal adoptions. The study can be found here: https://web.archive.org/web/20100401052756/http://www.animalsandsociety.org/assets/library/78_jaawsleeper.pdf

The 2002 study uses logistic regression to determine differences between euthanasia and adoption between groups of animals. However, the Austin Animal Shelter maintains a "no-kill" policy. According to a press release from the shelter (https://www.austintexas.gov/news/austin-animal-center-has-no-more-space-asks-community-help), "No-kill shelters strive to only euthanize animals who are irremediably suffering or pose a significant public safety threat" and that "Austin Animal Center is required by city ordinance to meet or exceed a 95% live outcome rate." Yet, we note that the overall kill rate is higher than 5% as demonstrated above in the initial plots sections.

The original study used logistic regression on animals euthanized. However, Austin's "no-kill" policy makes using logistic regression less desirable here as euthanasia is not necessarily the result of an animal that has been at the shelter a long time. As a result, we will focus on length of stay within the shelter to determine whether an animal is desired by potential adopters. This metric was also used by another study of two New York "no-kill" shelters that can be found here: https://www.tandfonline.com/doi/abs/10.1080/10888705.2013.740967.

Our analysis assumes that these animals were up for adoption throughout their stay at the shelter. We also seeks to correct errors in methodology in the original Sacramento study which assumed that animals were euthanized if they were "not adopted." However, the study does not tell us if the shelter euthanized after a constant period an animal remained un-adopted for or if there were other factors that led to euthanasia. For example, an animal that may appear un-adoptable or more aggressive, as determined by shelter staff, may have been euthanized before other animals. As a result, the original analysis may be a result of biases in the shelter's own selection process and not indicative of the behavior of those adopting from the shelter.

Let’s first consider only dogs that were adopted and that had normal status upon arrival. This will filter out cases of injury that may prolong the length of time an animal stayed at the shelter but not the time that it was available for adoption.  

Now, we will separate the observations of dogs that have a pure black coat and all others.

Hmm, seems the summary for these distinct sets of observations appear nearly the same. However, many of the colors within the non-black category may have black patterns or otherwise. For example, a Black/White dog could have an almost entirely black coat and be considered a black dog. Indeed, there are actually labels for both Black/White and White/Black. Assuming that the first color is the dominant one in the animal's coat, let's see if we can further refine our data.

Even considering dogs that are partially black, we get a similar distribution between both sets. In fact, the lighter coats actually have a larger mean stay in this case. However, this seems to be mostly the result of outliers as the medians between sets are within 2 hours of another.

The distributions look very similar in all respects except in the number of observations. Let's perform a Mann-Whitney U rank test to see if there is a significant difference in the means here (just to be sure). Since we are only interested if dogs with primarily black coats are larger than their counterparts, we will use an upper test. We are using the Mann-Whitney U rank test since we cannot confidently assume that the distribution is normal here, based on the plots above.

Due to the very large p-value, we reject the alternate hypothesis that dogs with coats that are primarily black stay in the shelter longer than their counterparts with lighter coats. Therefore, we cannot conclude that those that adopt from the Austin shelter show any prejudice against black or dogs with darker coats.

While we do not see any preference against black dogs within Austin, it is hard to say if this result generalizes particularly to the United States. Some of the underlying theories are due to cultural factors meaning this data probably does not give any inference to the behavior of adopters in other nations.

Similarly, cats have their own "Black Cat Syndrome" as they may be associated with witchcraft and supernatural events. Furthermore, some cultures regard them as bad luck. https://www.history.com/news/black-cats-superstitions

Let's perform the same test procedure again, this time for cats within the shelter.

A little more difference for darker coats here than before. However, the ratio of black cats to total cats is smaller than black dogs to total dogs. Also, note the significant difference in turnaround time between cats and dogs. A median of around 8 days for dogs to 25+ for cats. Both the mean and median for black cats are around 5 days higher than those without a black coat at 30 versus 24 days and 43 versus 38 days respectively.

The distributions once again appear similar in shape. However, we once again cannot assume a normal distribution for either of these two sets. We will perform another Mann-Whitney U rank test to determine if the difference in means we noted above is significant enough.

With a p-value that is essentially zero, we conclude that the mean stay of cats with darker coats is greater than those without. While we cannot conclude that this is directly the result of adopters purposefully avoiding cats with dark coats, the conclusiveness of the test tells us that this result is almost certainly not by random noise. Thus, there does seem to be a sort of Black Cat Syndrome that influences length of stay within the shelter. Despite the fact that we cannot conclude it definitively, the behavior of adopters in avoiding black cats is a reasonable explanation as to why we can see such a difference in this data.

Using this information, A shelter can consider this factor into their adoption process. This could mean making black cats more visible for adopters visiting the shelter, making material that dispels the myths or biases that adopters have against black cats or promoting black cats in animal spotlights or other communications. Knowing is half the battle here and having empirical evidence that shows any potential bias can enable a shelter to act to confront it. In general, using hypothesis testing can allow a shelter or other non-profit to establish a clear ground truth. This can pave the way for policy that targets issues at their roots.

Predicting Length of Stay with Regression Trees

For a shelter, predicting which animals are most coveted by adopters and those that are less desirable can allow it to refocus its energy, time, and expertise to adopt out animals that may otherwise be passed over or not considered. Such a strategy can save the vital and often insufficient resources of a shelter by reducing the number of animals with long and costly shelter stays.

Some of the potential factors in adoption are easy to predict. Age of the animal, what species it is, and temperment are factors that many adopters have in mind before they even visit a shelter. Some of these are fairly obvious and considered by shelters on a non-empirical basis. Age of an animal is a clear example as most shelters are aware in some capacity that older pets are often passed over by adopters in favor of younger animals.

We will use a regression decision tree to demonstrate how even simple machine learning models can provide significant information when it comes to determining potential animal outcomes.

This will require several adjustments to our dataframe including dropping redundant information from our data and reincoding it to be understood by our regression tree.

Again, we will only consider the traditional adoption process to make things simple.

We need to convert our datetime values to floats to be read by the model. Let's start with the dates by defining a converting function.

Now let’s convert the sex_upon_intake and sex_upon_outcome into was_spayed and sex columns to make them binary properties for our tree.

Let's also remove the few unknown sexes in the data.

Now, let's remove the sex columns.

The intake condition column also skews our data significantly. Animals that are injured or otherwise have a condition that is not normal will potentially go through a different process prior being adopted adoption. This may be receiving medical care which subverts the traditional adoption pipeline as explored above. Let's remove this column as well.

The release_date column also needs to be removed as it can be used to the classifier to "cheat" and determine time of stay. Which is what we want to predict. Let's remove it as well.  

Our remaining variables are categorical. Breed and color are categories but have a massive number of possible values. We will not be able to traditionally one-hot encode these features without suffering significant issues due to the high cardinality.

Instead, we will use leave one out encoding as it is referenced here: https://innovation.alteryx.com/encode-smarter/. More information on Owen Zhang's leave one out encoding strategy is also available here: https://datascience.stackexchange.com/questions/10839/what-is-difference-between-one-hot-encoding-and-leave-one-out-encoding. The main attribute of this encoding strategy is its ability to encode categories with very large cardinality which is what we have present in our data.

Now that we have a tree. We can visualize what features it is using to make decisions, unfortunately it’s not too helpful or intuitive due to our use of a regression tree and our encoded features. It does give us some insight into what features of the data are most important to split on initially. 

Let's see what attributes are most important when considering how long an animal will stay at the shelter (according to the tree).

The decision tree determines that gender and if the animal was fixed or not are fairly important feature for its initial splits. Does this tell us that the gender of the animal is most important for determining the time of its initial stay?

Not really, as it doesn't appear there is any significant difference between the length of stay of female and male animals in the data. It is likely that the reason the tree is considering these features more important is due to their binary nature as opposed to the other features that have high cardinality.

Looks like a fairly high coefficient of determination (R^2) for both sets. The test data is slightly less accurate in this metric, as is expected.

Let's predict a few values from the test set as well to give us an idea of how an animal shelter such as the Autin Animal Center can use this data to predict the length of stay of an incoming animal.

The predictions are fairly close!

This regression tree represents a fairly simple model in the machine learning space. However, even simple techniques can give relatively large insights into the nature of animals shelters work to rehabilitate and rehoming. Many of these facilities see thousands of animals pass through their doors annually. This gives machine learning models a wealth of information to use in predicting future animal outcomes. Sheltering information also allows these data driven models, which are used extensively in business and money-making applications, to have a purely positive impact by helping to support the saving of animal lives.

Conclusion

Hopefully this tutorial has made clear the numerous dimensions of insight that can be obtained for animal shelters through data analysis. This ranges from factual determination using hypothesis testing, mapping geographical data using geocoding services, and using simple machine learning algorithms that can effectively determine what an animal's stay in the shelter may look like. In this analysis, we were able to determine, from relatively simple data, that "Black Cat Syndrome" was a factor in adoptions from the shelter while "Black Dog Syndrome" was not. We also determined the geographical distribution of animals passed into the shelter through use of a heatmap and plotting different intake types. Finally, we found that a simple decision tree can be an effective way to determine animal outcomes.

Even this simple information can provide a world of difference for an animal shelter. Having a reasonable estimation of how long an animal will stay can allow such a non-profit to allocate and find the necessary resources or save these resources for when they are really needed. Work in the animal sheltering industry is often time consuming, costly, and sometimes disheartening. Using information that is already present, animal shelters can make such work more predictable, resource efficient, and easier on shelter staff while potentially saving more animal lives in the process. The informational feedback of how a shelter is performing can enable shelters to further refine their operations as well to target specific weaknesses or community issues as they appear.

Tech has been used and tried in almost all aspects of improving human lives. Data is now collected in nearly every industry from determining driving times to treating illnesses. However, using tech and data to improve the lives of our furry companions is not as prolific. Through similar techniques, we hope that many leverage the increasingly familiar power of data science and use it make real improvements in the lives of animals. With adoption of such methods, we can open the door to a new era that makes finding homes for companion animals an efficient and worry-free prospect.

Additional References:

https://towardsdatascience.com/saving-animal-lives-with-data-d815c6e854eb

https://www.thesprucepets.com/black-dog-syndrome-4796374

https://towardsdatascience.com/6-ways-to-encode-features-for-machine-learning-algorithms-21593f6238b0

http://dx.doi.org/10.1080/10888705.2013.740967

https://python-visualization.github.io/folium/modules.html